# Study on FPGA SEU Mitigation for the Readout Electronics of DAMPE BGO Calorimeter in Space

Zhongtao Shen, Changqing Feng, Shanshan Gao, Deliang Zhang, Di Jiang, Shubin Liu, and Qi An

*Abstract—***The BGO calorimeter, which provides a wide measurement range of the primary cosmic ray spectrum, is a key sub-detector of the Dark Matter Particle Explorer (DAMPE). The readout electronics of calorimeter consists of 16 pieces of Actel ProASIC Plus FLASH-based field-programmable gate array (FPGA), of which the design-level flip-flops and embedded block random access memories (RAM) are single event upset (SEU) sensitive in the harsh space environment. To comply with radiation hardness assurance (RHA), SEU mitigation methods, including partial triple modular redundancy (TMR), CRC checksum, and multi-domain reset are analyzed and tested by the heavy-ion beam test. Composed of multi-level redundancy, a FPGA design with the characteristics of SEU tolerance and low resource consumption is implemented for the readout electronics.**

*Index Terms—***FPGA, SEU mitigation.**

#### I. INTRODUCTION

**T** HE Dark Matter Particle Explorer (DAMPE) is being constructed as a scientific satellite to search for the proof of the existence of dark matter in space. As shown in Fig. 1, the DAMPE consists of four sub-detectors: a plastic scintillator detector (PSD), a silicon tracker (STK), a BGO calorimeter (BGO), and a neutron detector (ND) [1], [2]. As well, there is a trigger board providing a trigger signal to the four sub-detectors, and a controlling computer in charge of controlling and corresponding with the four sub-detectors. As the satellite is designed to fly on a near-earth orbit with the altitude of 500 km for more than three years, the radiation damage effects on the semiconductor caused by high-energy particles in space environment is one of the main concerns for the reliability of the space electronics.

The task of the BGO calorimeter is observation of high-energy electrons/positrons and gamma rays [3]. The readout electronics system of the BGO consists of 16 front-end electronics (FEE) boards. On each of the FEE board, a field-programmable

Manuscript received June 16, 2014; revised December 12, 2014; accepted April 18, 2015. Date of publication May 21, 2015; date of current version June 12, 2015. This work was supported by the Strategic Priority Research Program on Space Science of the Chinese Academy of Sciences (Grant XDA04040202-4), the CAS Center for Excellent in Particle Physics (CCEPP) and the National Basic Research Program (973 Program) of China (Grant 2010CB833002).

S. Liu is with the State Key Laboratory of Particle Detection and Electronics, University of Science and Technology of China, Hefei, Anhui, China (e-mail: liushb@ustc.edu.cn).

Z. Shen, C. Feng, S. Gao, D. Zhang, D. Jiang, and Q. An are with the Department of Modern Physics, University of Science and Technology of China, Hefei, Anhui, China.

Digital Object Identifier 10.1109/TNS.2015.2427293

**-PSD STK** ►BGO ►ND

Fig. 1. DAMPE detector cross section. The DAMPE consists of PSD, STK, BGO, and ND.



Fig. 2. Structure of APA FPGA [4].

gate array (FPGA) works as the controlling chip, in charge of handling the commands sent to the FEE, controlling the work status of other chips on the FEE board, and acquiring, packaging and sending the scientific data and engineering parameters. Two kinds of flash-based FPGA of Actel, ProASIC Plus APA300 (APA300) which has 300,000 system gates and ProASIC Plus APA600 (APA600) which has 600,000 system gates, are chosen as the BGO FEE controlling chips. These two types of chips are the same except for the number of resources.

The APA family, which adopts  $0.22 \mu m$  LVCMOS process with four layers of metal, has abundant programmable resources such as logic tiles, global nets, embedded random access memories (RAM), and I/Os, as shown in Fig. 2. It uses a live-atpower-up in-system programming (ISP) flash switch as its programming element [4].

0018-9499 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications\_standards/publications/rights/index.html for more information.



Fig. 3. Structure of TMR. Three replicas for one memory cell are used and a voter identifies the correct result among the three on the basis of a majority vote.

In the switch, two transistors share the floating gate, which stores the programming information. The upset mechanism for a heavy ion is to discharge the floating gate by generating charge in the bottom and top oxides that diffuse to the floating gate. However, the amount of charge generated by an ion with linear energy transfer (LET) value of 37 MeV $\cdot$ cm<sup>2</sup>/mg is less than 1% of the total charge on a programmed floating gate [5]. Therefore, the configuration unit is insensitive to single event effect (SEE) and the programming information stored in logic tiles and I/Os is unlikely to be changed when the chip works in space.

A total ionizing dose (TID) test for the FEE board with APA600 was conducted at the University of Science and Technology of China (USTC). A  ${}^{60}Co$  gamma source is used in the TID test. The whole FEE board is irradiated up to 10 Krad (Si), which meets TID demands for three years in space. The experiment results show that no evident degradation of the FPGA is found. All functions of the FPGA worked well during the experiment.

There are experiments showing that the D-type flip-flop (DFF) configured from the logic tile and the embedded RAMs are sensitive to SEU. The test result shows that the SEU LET value for the DFF is less than  $3 \text{ MeV} \cdot \text{cm}^2/\text{mg}$ . Based on this value, CREME96 predicts that the SEU probability for the chip is about  $6.8 \times 10^{-7}$  bit<sup>-1</sup>  $\cdot$  day<sup>-1</sup> at the altitude of 500 km [5], [6], [7], [8]. Considering that there are about 3,000 memory units in the APA chip and one-bit error can cause software error or even infinite loop status, it is necessary to take some measures to mitigate SEU when the ProASIC Plus device is used in space.

## II. SEU MITIGATION TECHNIQUES

### *A. Triple Modular Redundancy*

The triple modular redundancy (TMR) technique is the most commonly adopted passive hardware redundancy technique achieving fault masking properties [9]. It uses three replicas for one memory cell and adds a voter that identifies the correct result among the three ones on the basis of a majority vote, as shown in Fig. 3. The technique may be applied at different levels, from the whole system to a single register.

The TMR technique improves the reliability of the system, but it causes some problems at the same time, such as much more resource consumption, speed reduction, power increase, difficulty in placing and routing, and so on. To achieve maximum SEU tolerance with appropriate engineering consumption, usually TMR is only used in key registers and RAMs.



Fig. 4. Structure of FPGA of BGO FEE. The control part and the scientific data acquisition part implement the main function of the logic and take up about 78% of all logic resource consumption; system settings saved in the registers and RAMs in the status part and the monitor part are used for real-time monitoring.

## *B. Error Detection and Correction*

Error detection and correction codes (EDAC) are often used to improve the reliability of data storage media. The general idea for achieving error detection and correction is to add some redundancy. The redundancy is a fixed number of check bits, which are derived from the data bits by some special algorithm. The error detection schemes include parity bits, Hamming code, cyclic redundancy check, and others.

## III. SEU MITIGATION APPLIED IN FEE FPGA DESIGN

## *A. Structure of FPGA of BGO FEE*

The FPGA of BGO FEE mainly consists of four parts: the scientific data acquisition part, the control part, the monitor part, and the status manager part, as shown in Fig. 4. The scientific data acquisition part communicates with the peripheral chips related to data acquisition, caches the data got from the peripheral devices, packages the scientific data, and sends the data package. The control part receives the command sent to FPGA, verifies the validity of the command, executes the command, and gives the response. These two parts, which achieve the main function of the FPGA, take up about 78% of all logic resource consumption. In the status manager part, there are some key registers and RAMs which are used to set the operating status of the system and need to be valid all the time. Other parts get operating settings from the status part. The monitor part provides some parameters describing the operating status for real-time monitoring.

When a command comes from the controlling computer, the control part begins to handle it. When a trigger signal comes from the trigger board, it is handled by the scientific data acquisition part. There are many steps in both the command handling procedure and the scientific data acquisition handling procedure. When a command handling procedure or a scientific data acquisition procedure is finished, the part of the logic is under reset state until the next command or trigger signal arrives.



Fig. 5. Multi-domain reset signal. Hw\_rstn is the global hardware reset signal; soft\_rstn is the global software signal; cmd\_path\_rstn and sci\_path\_rstn are used to reset the control part and the scientific data acquisition part respectively.

## *B. Multi-Domain Reset and Multi-Level Reset*

As mentioned before, the logic has a command handling procedure and a scientific data acquisition procedure. In the FPGA logic, each procedure is implemented by a finite-state machine (FSM). The FSM controls the peripheral devices and needs the feedback signals from peripheral devices for state transformation. In space, SEU happening in the FSM or loss of peripheral device signals may lead the control part or the scientific data acquisition part into an infinite loop status, at which the system cannot work properly and needs to be reset. However, it is not necessary to reset the whole system if only parts of the logic are under infinite loop condition. Therefore, multi-domain reset and multi-level reset are adopted in FPGA design.

As shown in Fig. 5, four reset signals are used in the logic design, including hw\_rstn, soft\_rstn, cmd\_path\_rstn, and sci path rstn. Hw rstn is the hardware reset signal which comes from the external reset chip. It resets all registers and RAMs to their default values and is used to initialize the FPGA when powering on. Soft rstn comes from the reset command and resets all registers and RAMs except a shifter which is used in command reset. Cmd\_path\_rstn and sci\_path\_rstn are used to reset the control part and the scientific data acquisition part, respectively.

When a data acquisition process begins, a timer starts to count. If the data acquisition process does not finish in 1 ms, which is about 1.5 times as long as the time of a normal scientific data acquisition procedure, the sci\_path\_rstn becomes active and resets the scientific data acquisition part automatically. This is the principle of sci\_path\_rstn, and cmd\_path\_rstn works in a similar way. These two reset signals prevent the system from entering infinite loop status and guarantee that a SEU happening in the previous process does not influence the next one. Besides, when the scientific data acquisition part and the control part are idle, the reset signals are active and set the two parts into reset state, which can effectively avoid the influence of SEU.

# *C. TMR and CRC*

As mentioned before, most of the registers in the logic belong to the control part or the scientific data acquisition part. They are reset when the procedure is done. However, in the status part, there are some key registers and RAMs which are used as system operating settings and need to be valid all the time. The TMR technique is used in these vital registers and RAMs to make sure that their values are correct.



Fig. 6. TMR structure of va\_cfgreg.

Fig. 6 shows the TMR structure of va\_cfgreg, which is one of the key registers in the FPGA. In the structure, each replica is a DFF and the voter consists of combinational logic circuits. The input of each DFF is connected with a 2:1 multiplexer, which decides whether the outside signal or the voter result is the input of the DFF. When the register does not need to be rewritten, the voter result is chosen as the inputs of the three DFFs and their values are refreshed by the voter result at each rising edge of clock signal. Therefore, if SEU happened in one DFF, the voter result, which is decided by the majority of the three DFFs, is still correct and the wrong value would be refreshed by the correct value immediately.

To avoid multiple bit upset (MBU) happening in two or three replicas of one TMR unit, before automatic placement and routing, three replicas are manually set in different physical areas.

To make further efforts to ensure the correctness of the values in key registers, the values are monitored as engineering parameters all the time. However, unlike the ones in registers, the data in RAMs are so many, they cannot be monitored in real-time. Therefore, the cyclic redundancy check (CRC) is adopted to automatically detect data corruption in RAMs.



**FPGA Flat Cable** 

Fig. 7. SEU test site. The test was performed at HIRFL-TR5 terminal, using Krypton ions. Irradiations were conducted in air, at ambient room temperature, with heavy ions passing through a vacuum/air transition foil. The LET values of the ions could be adjusted from 22.7 to 39.9 MeV  $\cdot$  cm<sup>2</sup>/mg.

The CRC result is calculated first and attached to the end of a RAM. When the data in a RAM are used, the CRC result is calculated again and compared with the one at the end of the RAM. Mismatch of the two CRC results shows data corruption in RAM and an indicating bit is active automatically, which is also monitored as an engineering parameter and used as a call for RAM reconfiguration.

## IV. TEST AND RESULTS

After using these means in the logic design, an ion-beam test was performed at the Heavy Ion Research Facility in Lanzhou (HIRFL) cyclotrons to evaluate the SEU tolerance of FPGA logic, as shown in Fig. 7.

# *A. Test Platform*

The test system consists of a host PC, a Master board and a device under test (DUT) board. A LabVIEW program works on the host PC to control the whole system and monitor the status of APA. The Master board, which works in the environment without radiation, is in charge of communicating with the host computer and powering up the DUT board. The APA chip is placed on the DUT board and only the DUT board is placed under radiation condition. With the architecture of host-PC–Master board–DUT board, the host-PC and Master board can work without the influence of radiation.

## *B. Irradiation Test with High LET Value*

The test was performed at HIRFL-TR5 terminal, using Bismuth ions. Irradiations were conducted in air, at ambient temperature, with heavy ions passing through a vacuum/air transition foil. By changing the thickness of air, the LET value was adjusted to about 90 MeV  $\cdot$  cm<sup>2</sup>/mg which is much higher than what the chip will be exposed to in space. An APA600 chip was configured with a simple logic and tested under the radiation condition. Because the Bismuth ions cannot go through the package lid of the APA600 chip, the package lid was removed.

The current of the logic array power supply and I/O pad power supply of the FPGA were monitored by a monitor circuit on the Master board, which has the precision of 0.1 mA.

APA600 worked under the irradiation for about 10 minutes and the total fluence reached up to  $8.4 \times 10^6$  ions/cm<sup>2</sup>. During this time, the supply current of the FPGA was monitored and no abnormal current was found, which means no SEL happens in the experiment and the APA chips are insensitive to SEL. The FPGA worked well during the experiment and no configuration corruption was observed, which means the structure of the flash switch in APA is immune to SEU and the configuration information in APA will not be changed when it works in space. The experiment results are consistent with the research done by Allen and Swift [5].

## *C. Functional Test*

The test was performed at HIRFL-TR5 terminal, using Krypton ions. Because on the DUT board there are no peripheral devices which participate in data acquisition and other functions on FEE, the logic parts communicating with these devices are removed and the science data package is simply filled by "0x55AA". With these changes, the resource consumption was reduced by about 10%. Configured with this lite version of logic, APA300 is put on the condition with the LET value of 39.6 MeV  $\cdot$  cm<sup>2</sup>/mg and the ion flux of 100 ions/cm<sup>2</sup>/s. The experiment continued for about 25 minutes and the FPGA worked without errors. When receiving commands, the control part handled it without any error and the science data acquisition part also worked well.

Through the experiment, the three ways used in the logic design for SEU mitigation were proved to be effective.

#### V. DISCUSSION

## *A. Reliability of TMR with Correction*

Using the Markov model in fault-tolerant computing, a system's reliability can be predicted [10]. According to the study of [10], if reliability is defined as the probability that the value in a single DFF register is still correct after a certain time, then

$$
R_{\text{DFF1}}\left(t\right) = e^{\left(-\lambda t\right)}\tag{1}
$$

where  $\lambda$  represents the rate that the value in a DFF transitions from correctness to error per unit time and  $t$  represents time.

Concerning structure with TMR DFF, then

$$
R_{\rm DFF2}\left(t\right) = \frac{\left(\mu + 5\lambda\right)\sinh\left(\frac{1}{2}t\sqrt{\mu^2 + 10\lambda\mu + \lambda^2}\right)e^{-\frac{1}{2}\left(\mu + 5\lambda\right)t}}{\sqrt{\mu^2 + 10\lambda\mu + \lambda^2}} + \cosh\left(\frac{1}{2}t\sqrt{\mu^2 + 10\lambda\mu + \lambda^2}\right)e^{-\frac{1}{2}\left(\mu + 5\lambda\right)t} \quad (2)
$$

where  $\lambda$  represents the rate that the value in a DFF transitions from correctness to error per unit time,  $\mu$  represents the repair rate per unit time, and  $t$  represents time.

Considering there are 3,000 key registers in the FPGA logic, if system reliability is defined as the probability that all registers



Fig. 8. System reliability changes with time.  $\lambda$ , which represents the rate that the value in a DFF transitions from correctness to error per unit time, is assumed to be  $10^{-7}$  hour<sup>-1</sup>.  $\mu$ , which represents the repair rate per unit time, is assumed to be 1 hour $^{-1}$  in the architecture of TMR DFF.

in the system are correct, then

$$
R_{\text{system}}\left(t\right) = R_{\text{DFF}}^{3000}.\tag{3}
$$

As CREME96 predicts that the SEU probability for the APA chip is about  $6.8 \times 10^{-7}$  bit<sup>-1</sup> · day<sup>-1</sup> at the altitude of 500 km [6], [7], [8],  $\lambda$  is assumed to be  $10^{-7}$  hour<sup>-1</sup>. Because there is a write-back line in the structure of TMR in the design,  $\mu$  is very close to 1 hour<sup>-1</sup>. Combining these two parameters with (1), (2), and (3), the system reliability changing with time can be derived and the results are shown in Fig. 8. Compared with that of the system without TMR, the reliability of the system with TMR almost does not change with time and is still very close to 1 after 30,000 hours. Therefore, the structure of TMR with write-back line can highly improve the reliability of a system.

Using the system reliability, the mean time between failures (MTBF) of the system can be calculated. As shown in (4) and (5), the MTBF of the system can be improved by  $3.3 \times 10^5$  times if using the structure of TMR with write-back line.

MTBF<sub>WithoutTMR</sub> = 
$$
\frac{\int_0^\infty \frac{d(1 - R_{\text{DFF1}}^{3000})}{dt} * t dt}{\int_0^\infty \frac{d(1 - R_{\text{DFF1}}^{3000})}{dt} dt}
$$
  
 := 3000 hours; (4)  
MTBF<sub>WithTMR</sub> = 
$$
\frac{\int_0^\infty \frac{d(1 - R_{\text{DFF2}}^{3000})}{dt} * t dt}{\int_0^\infty \frac{d(1 - R_{\text{DFF2}}^{3000})}{dt} dt}
$$
  
 := 10<sup>9</sup> hours. (5)

#### *B. The Reliability of the Ion Test*

According to the test results, the SEU threshold for the APA chip is about 3 MeV  $\cdot$  cm<sup>2</sup>/mg [5]. At the altitude of 500 km, the flux at this LET value is less than 0.5 ions/cm<sup>2</sup>/s [6], [7], [8], which means the total fluence in space for three years is less than  $4.7 \times 10^7$  ions/cm<sup>2</sup>. The total fluence in the experiment at HIRFL-TR5 is about  $1.5 \times 10^5$  ions/cm<sup>2</sup>, and the LET value is

TABLE I COMPARISON OF RESOURCE CONSUMPTION AND PERFORMANCE REDUCTION

|                                 | Logic<br>Tile/% | Highest<br>Frequency/MHz | Power/mW |
|---------------------------------|-----------------|--------------------------|----------|
| Without Hardening<br>Techniques | 57.1            | 44.1                     | 124.1    |
| With Hardening<br>Techniques    | 77.0            | 33.2                     | 153.8    |

39.6 MeV  $\cdot$  cm<sup>2</sup>/mg. Based on the experiment data of [5], for APA chips the cross section at LET value of 39.6 MeV $\cdot$ cm<sup>2</sup>/mg is more than 1,000 times as much as the one at LET value of  $3 \text{ MeV} \cdot \text{cm}^2/\text{mg}$ . So the total fluence in the ion test is equivalent to  $1.5 \times 10^8$  ions/cm<sup>2</sup> at LET value of 3 MeV · cm<sup>2</sup>/mg, which is more than the expected three-year fluence in space.

As discussed above, due to the write-back line in the TMR structure, the SEU error will not accumulate with time. Therefore, the ion test simulates a much harsher condition and the result shows that the FPGA logic with SEU mitigation techniques has SEU tolerance in space environment.

## *C. Resource Consumption and Performance Reduction*

The methods of SEU mitigation have efficiently reduced the sensitivity to SEU of the FPGA. However, as mentioned before, resource consumption and speed degradation also need to be considered when logic hardening techniques are used.

Table I shows the comparison of the logic tile consumption, highest work frequency, and power between the logic with hardening techniques and the one without hardening techniques. From the table, we can see that due to the SEU mitigation design, the logic tile increases from 57.1% to 77.0%, which is still very low, and the Place&Route can be finished by computer automatically without any problems. The highest work frequency declines to 33.2 MHz, which still has a margin of more than 50% compared to the work frequency of 20 MHz. The power, which increases from 124.1 mW to 153.8 mW, will not cause a problem either. Therefore, not only do the techniques used in the logic design mitigate SEU effectively, but also the resources they consume are acceptable.

## VI. CONCLUSION

In order to enable the APA to function well in space for DAMPE, SEU mitigation techniques including TMR, CRC, and multi-domain reset are used in the logic. The FPGA with logic reinforcement for SEU was tested in an ion-beam experiment and proved to have the ability to work properly in space radiation environment.

#### **REFERENCES**

- [1] J. Chang *et al.*, "An excess of cosmic ray electrons at energies of 300–800 GeV," *Nature*, vol. 456, no. 7220, pp. 362–365, Nov. 2008.
- [2] J. Chang, "Dark matter particles detection in space," *J. Eng. Studies*, vol. 2, no. 2, pp. 95–99, Jun. 2010.
- [3] J. Chang, "Dark Matter Particle Explorer: The first Chinese cosmic ray and hard  $\gamma$ -ray detector in space," *Chinese J. Space Sci.*, vol. 34, no. 5, pp. 500–557, 2014.
- [4] ProASICPLUS flash family FPGAs datasheet, V5.9, Actel Corp., Aliso Viejo, CA, USA, Dec. 2009.
- [5] G. R. Allen and G. M. Swift, "Single event effects test results for advanced field programmable gate arrays," presented at the 2006 IEEE Radiation Effects Data Workshop, Ponte Vedra, FL, USA, Jul. 2006.
- [6] A. J. Tylka, J. H. Adams, and P. R. Boberg *et al.*, "CREME96: A revision of the cosmic ray effects on micro-electronics code," presented at the 34th Annu. IEEE Int. Nuclear and Space Radiation Effects Conf. (NSREC'97), Snowmass Village, CO, USA, Jul. 1997.
- [7] R. A. Weller, M. H. Mendenhall, R. A. Reed, R. D. Schrimpf, K. M. Warren, B. D. Sierawski, and L. W. Massengill, "Monte Carlo simulation of single event effects," *IEEE Trans. Nucl. Sci.*, vol. 57, no. 4, pp. 1726–1746, Aug. 2010.
- [8] M. H. Mendenhall and R. A. Weller, "A probability-conserving crosssection biasing mechanism for variance reduction in Monte Carlo particle transport calculations," *Nucl. Inst. Meth. A*, vol. 667, pp. 38–43, Mar. 2012.
- [9] D. P. Siewiorek and R. S. Swarz*, The Theory and Practice of Reliable System Design*. Bedford, MA, USA: Digital Press, 1982.
- [10] D. McMurtrey, K. Morgan, B. Pratt, and M. Wirthlin, "Estimating TMR reliability on FPGAs using Markov models," Brigham Young Univ., Dept. Electr. Comput. Eng., Tech. Rep., 2007 [Online]. Available: http://hdl.handle.net/1877/644